Automatic transformation of lecture transcription into document style using statistical framework
نویسندگان
چکیده
This paper addresses automatic transformation from spoken style texts to written style texts. Exact transcriptions and speech recognition results of live lectures include many spoken language expressions, and thus, are not suitable for documents and need to be edited. In this paper, we present a method of applying of the statistical approach used in machine translation to this post-processing task. Specifically, we implement the correction of colloquial expressions, the deletion of fillers, the insertion of periods, and the insertion of particles in an integrated manner. A preliminary evaluation confirms that the statistical transformation framework works well and we achieved high recall and precision rate of period and particle insertion.
منابع مشابه
Automatic Transcription of Lecture Speech using Language Model Based on Speaking-Style Transformation of Proceeding Texts
For language modeling of spontaneous speech recognition, we propose a style transformation approach, which transforms written texts to a spoken-style language model. Since these two styles are largely different and thus direct transformation is difficult, we cascade two transformation methods; rule-based transformation to rewrite written-style texts to intermediate “verbatim” texts, and statist...
متن کاملA monotonic statistical machine translation approach to speaking style transformation
This paper presents a method for automatically transforming faithful transcripts or ASR results into clean transcripts for human consumption using a framework we label speaking style transformation (SST). We perform a detailed analysis of the types of corrections performed by human stenographers when creating clean transcripts, and propose a model that is able to handle the majority of the most...
متن کاملThe ISL Baseline Lecture Transcription System for the TED Corpus
This paper describes the Interactive Systems Laboratories’ automatic lecture transcription system for the Translanguage English Database (TED) corpus, which provides text-hypothesis for the International Workshop on Speech Summarization for Information Extraction and Machine Translation. Furthermore the paper gives a short analysis of speaking style characteristics, in particular addressing nat...
متن کاملTowards Automatic Transformation between Different Transcription Conventions: Prediction of Intonation Markers from Linguistic and Acoustic Features
Because of the tremendous effort required for recording and transcription, large-scale spoken language corpora have been hardly developed in Japanese, with a notable exception of the Corpus of Spontaneous Japanese (CSJ). Various research groups have individually developed conversation corpora in Japanese, but these corpora are transcribed by different conventions and have few annotations in com...
متن کاملProvide a model for the establishment of the school in accordance with the indicators and requirements of the Education Transformation Document
Purpose: The aim of this study was to provide a model for school establishment in accordance with the indicators and requirements of the Education Transformation Document. Methodology: The research method was basic-applied in terms of purpose, descriptive-survey in terms of data collection method and combined in terms of data type. The statistical population of the study in the qualitative sect...
متن کامل